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ABSTRACT 

Classifying videos according to content semantics is an im¬ 
portant problem with a wide range of applications. In this 
paper, we propose a hybrid deep learning framework for 
video classification, which is able to model static spatial 
information, short-term motion, as well as long-term tem¬ 
poral clues in the videos. Specifically, the spatial and the 
short-term motion features are extracted separately by two 
Convolutional Neural Networks (CNN). These two types of 
CNN-based features are then combined in a regularized fea¬ 
ture fusion network for classification, which is able to learn 
and utilize feature relationships for improved performance. 
In addition. Long Short Term Memory (LSTM) networks 
are applied on top of the two features to further model 
longer-term temporal clues. The main contribution of this 
work is the hybrid learning framework that can model sev¬ 
eral important aspects of the video data. We also show 
that (1) combining the spatial and the short-term motion 
features in the regularized fusion network is better than di¬ 
rect classification and fusion using the CNN with a softmax 
layer, and (2) the sequence-based LSTM is highly comple¬ 
mentary to the traditional classification strategy without 
considering the temporal frame orders. Extensive experi¬ 
ments are conducted on two popular and challenging bench¬ 
marks, the UCF-101 Human Actions and the Columbia Con¬ 
sumer Videos (CCV). On both benchmarks, our framework 
achieves to-date the best reported performance: 91.3% on 
the UCF-101 and 83.5% on the CCV. 


Categories and Subject Descriptors 

H.3.1 [Information Storage and Retrieval]: Content 
Analysis and Indexing —Indexing methods; 1.5.2 [Pattern 
Recognition]: Design Methodology —Classifier design and 
evaluation 
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1. INTRODUCTION 

Video classification based on contents like human actions 
or complex events is a challenging task that has been ex¬ 
tensively studied in the research community. Significant 


^Corresponding author. 


progress has been achieved in recent years by designing var¬ 
ious features, which are expected to be robust to intra-class 
variations and discriminative to separate different classes. 
For example, one can utilize traditional image-based features 
like the SIFT to capture the static spatial information in 
videos. In addition to the static frame based visual features, 
motion is a very important clue for video classification, as 
most classes containing object movements like the human 
actions require the motion information to be reliably recog¬ 
nized. For this, a very popular feature is the dense trajecto¬ 
ries [^, which tracks densely sampled local frame patches 
over time and computes several traditional features based 
on the trajectories. 

In contrast to the hand-crafted features, there is a grow¬ 
ing trend of learning robust feature representations from raw 
data with deep neural networks. Among the many existing 
network structures. Convolutional Neural Networks (CNN) 
have demonstrated great success on various tasks, including 
image classification 3^ 3^, image-based object localiza¬ 


tion [^, speech recognition [^ , etc. For video classification, 
Ji et al. and Karparthy et al. extended the CNN 
to work on the temporal dimension by stacking frames over 
time. Recently, Simonyan et al. proposed a two-stream 
CNN approach, which uses two CNNs on static frames and 
optical flows respectively to capture the spatial and the mo¬ 
tion information. It focuses only on short-term motion as 
the optical flows are computed in very short time windows. 
With this approach, similar or slightly better performance 
than the hand-crafted features like has been reported. 

These existing works, however, are not able to model the 
long-term temporal clues in the videos. As aforementioned, 
the two-stream CNN uses stacked optical flows com¬ 
puted in short time windows as inputs, and the order of 
the optical flows is fully discarded in the learning process 
(cf. Section 3.1). This is not sufficient for video classifica¬ 
tion, as many complex contents can be better identified by 
considering the temporal order of short-term actions. Take 
“birthday” event as an example—it usually involves several 
sequential actions, such as “making a wish”, “blowing out 
candles” and “eating cakes”. 

To address the above limitation, this paper proposes a hy¬ 
brid deep learning framework for video classification, which 
is able to harness not only the spatial and short-term mo¬ 
tion features, but also the long-term temporal clues. In or¬ 
der to leverage the temporal information, we adopt a Re¬ 
current Neural Networks (RNN) model called Long Short 
Term Memory (LSTM), which maps the input sequences 
to outputs using a sequence of hidden states and incorpo- 
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Figure 1: An overview of the proposed hybrid deep learning framework for video classification. Given an 
input video, two types of features are extracted using the CNN from spatial frames and short-term stacked 
motion optical flows respectively. The features are separately fed into two sets of LSTM networks for long¬ 
term temporal modeling (Left and Right). In addition, we also employ a regularized feature fusion network 
to perform video-level feature fusion and classification (Middle). The outputs of the sequence-based LSTM 
and the video-level feature fusion network are combined to generate the final prediction. See texts for more 
discussions. 


rates memory units that enable the network to learn when 
to forget previous hidden states and when to update hidden 
states with new information. In addition, many approaches 
fuse multiple features in a very “shallow” manner by either 
concatenating the features before classification or averaging 
the predictions of classifiers trained using different features 
separately. In this work we integrate the spatial and the 
short-term motion features in a deep neural network with 
carefully designed regularizations to explore feature correla¬ 
tions. This method can perform video classification within 
the same network and further combining its outputs with 
the predictions of the LSTMs can lead to very competitive 
classification performance. 

Figure gives an overview of the proposed framework. 
Spatial and short-term motion features are first extracted 
by the two-stream CNN approach [^, and then input into 
the LSTM for long-term temporal modeling. Average pool¬ 
ing is adopted to generate video-level spatial and motion 
features, which are fused by the regularized feature fusion 
network. After that, outputs of the sequence-based LSTM 
and the video-level feature fusion network are combined as 
the final predictions. Notice that, in contrast to the cur¬ 
rent framework, alternatively one may train a fusion net¬ 
work to combine the frame-level spatial and motion features 
first and then use a single set of LSTM for temporal mod¬ 
eling. However, in our experiments we have observed worse 
results using this strategy. The main reason is that learn¬ 
ing dimension-wise feature correlations in the fusion network 
requires strong and reliable supervision, but we only have 
video-level class labels which are not necessarily always re¬ 
lated to the frame semantics. In other words, the imprecise 
frame-level labels populated from the video annotations are 


too noisy to learn a good fusion network. The main contri¬ 
butions of this work are summarized as follows: 

• We propose an end-to-end hybrid deep learning frame¬ 
work for video classification, which can model not only 
the short-term spatial-motion patterns but also the 
long-term temporal clues with variable-length video se¬ 
quences as inputs. 

• We adopt the LSTM to model long-term temporal 
clues on top of both the spatial and the short-term 
motion features. We show that both features work 
well with the LSTM, and the LSTM based classifiers 
are very complementary to the traditional classifiers 
without considering the temporal frame orders. 

• We fuse the spatial and the motion features in a regu¬ 
larized feature fusion network that can explore feature 
correlations and perform classification. The network is 
computationally efficient in both training and testing. 

• Through an extensive set of experiments, we demon¬ 
strate that our proposed framework outperforms sev¬ 
eral alternative methods with clear margins. On the 
well-known UCF-101 and CCV benchmarks, we attain 
to-date the best performance. 

The rest of this paper is organized as follows. Section 2 
reviews related works. Section 3 describes the proposed hy¬ 
brid deep learning framework in detail. Experimental results 
and comparisons are discussed in Section 4, followed by con¬ 
clusions in Section 5. 

















































































































































































2. RELATED WORKS 

Video classification has been a longstanding research topic 
in multimedia and computer vision. Successful classification 
systems rely heavily on the extracted video features, and 
hence most existing works focused on designing robust and 
discriminative features. Many video representations were 
motivated by the advances in image domain, which can be 
extended to utilize the temporal dimension of the video data. 
For instance, Laptev extended the 2D Harris corner de¬ 
tector into 3D space to find space-time interest points. 
Klaser et al. proposed HOG3D by extending the idea of inte¬ 
gral images for fast descriptor computation . Wang et al. 
reported that dense sampling at regular positions in space 
and time outperforms the detected sparse interest points on 
video classification tasks [^. Partly inspired by this finding, 
they further proposed the dense trajectory features, which 
densely sample local patches from each frame at different 
scales and then track them in a dense optical flow field over 
time 1^. This method has demonstrated very competitive 
results on major benchmark datasets. In addition, further 
improvements may be achieved by using advantageous fea¬ 
ture encoding methods like the Fisher Vectors or adopt¬ 
ing feature normalization strategies, such as RootSift and 
Power Norm [^. Note that these spatial-temporal video de¬ 
scriptors only capture local motion patterns within a very 
short period, and popular descriptor quantization methods 
like the bag-of-words entirely destroy the temporal order in¬ 
formation of the descriptors. 

To explore the long-term temporal clues, graphical mod¬ 
els have been popularly used, such as hidden Markov mod¬ 
els (HMM), Bayesian Networks (BN), Conditional Random 
Fields (CRF), etc. For instance, Li et al. proposed to replace 
the hidden states in HMMs with visualizable salient poses 
estimated by Gaussian Mixture Models , and Tang et al. 
introduced latent variables over video frames to discover the 
most discriminative states of an event based on a variable 
duration HMM [^. Zeng et al. exploited multiple types 
of domain knowledge to guide the learning of a Dynamic 
BN for action recognition [^. Instead of using directed 
graphical models like the HMM and BN, undirected graphi¬ 
cal models have also been adopted. Vail et al. employed the 
CRF for activity recognition in [^. Wang et al. proposed 
a max-margin hidden CRF for action recognition in videos, 
where a human action is modeled as a global root template 
and a constellation of several “parts” [4^ . 

Many related works have investigated the fusion of mul¬ 
tiple features, which is often effective for improving classih- 
cation performance. The most straightforward and popular 
ways are early fusion and late fusion. Generally, the early 
fusion refers to fusion at the feature level, such as feature 
concatenation or linear combination of kernels of individual 
features. For example, in [^, Zhang et al. computed non¬ 
linear kernels for each feature separately, and then fused the 
kernels for model training. The fusion weights can be manu¬ 
ally set or automatically estimated by multiple kernel learn¬ 
ing (MKL) [^. For the late fusion methods, independent 
classifiers are first trained using each feature separately, and 
outputs of the classifiers are then combined. In [^, Ye et 
al. proposed a robust late fusion approach to fuse multiple 
classification outputs by seeking a shared low-rank latent 
matrix, assuming that noises may exist in the predictions of 
some classifiers, which can possibly be removed by using the 
low-rank matrix. 


Both early and late fusion fail to explore the correlations 
shared by the features and hence are not ideal for video 
classification. In this paper we employ a regularized neu¬ 
ral network tailored feature fusion and classification, which 
can automatically learn dimension-wise feature correlations. 
Several studies are related. In 12 , the authors proposed 
to construct an audio-visual joint codebook for video clas¬ 
sification, in order to discover and model the audio-visual 
feature correlations. There are also studies on using neural 
networks for feature fusion. In [^, the authors employed 
deep Boltzmann machin es t o learn a fused representation 
of images and texts. In [^, a deep denoised auto-encoder 
was used for cross-modality and shared representation learn¬ 
ing. Very recently, Wu et al. presented an approach us¬ 
ing regularizations in neural networks to exploit feature and 
class relationships. The fusion approach in this work differs 
in the following. First, instead of using the traditional hand¬ 
crafted features as inputs, we adopt CNN features trained 
from both static frames and motion optical flows. Second 
and very importantly, the formulation in the regularized fea¬ 
ture fusion network has a much lower complexity compared 
with that of [47] . 

Researchers have attempted to apply the RNN to model 
the long-term temporal information in videos. Venugopalan 
et al. [^ proposed to translate videos to textual sentences 
with the LSTM through transferring knowledge from image 
description tasks. Ranzato et al. [^ introduced a genera¬ 
tive model with the RNN to predict motions in videos. In 
the context of video classification, Donahua et al. adopted 
the LSTM to model temporal information [^ and Srivas- 
tava et al. designed an encoder-decoder RNN architecture 
to learn feature representations in an unsupervised man¬ 
ner [^. To model motion information, both works adopted 
optical flow “images” between nearby frames as the inputs of 
the LSTM. In contrast, our approach adopts stacked opti¬ 
cal flows. Stacked flows over a short time period can better 
reflect local motion patterns, which are found to be able to 
produce better results. In addition, our framework incorpo¬ 
rates video-level predictions with the feature fusion network 
for significantly improved performance, which was not con¬ 
sidered in these existing works. 

Besides the above discussions of related studies on feature 
fusion and temporal modeling with the RNN, several rep¬ 
resentative CNN-based approaches for video classification 
should also be covered here. The image-based CNN features 
have recently been directly adopted for video classification, 
extracted using off-the-shelf models trained on large-scale 
image datasets like the ImageNet [H 31 5^. For instance, 
Jain et al. performed action recognition using the SVM 
classifier with such CNN features and achieved top results 
in the 2014 THUMOS action recognition challenge [^. A 
few works have also tried to extend the CNN to exploit the 
motion information in videos. Ji et al. and Karparthy et 
al. extended the CNN by stacking visual frames in fixed- 
size time windows and using spatial-temporal convolutions 
for video classification. Differently, the two-stream CNN ap¬ 
proach by Simonyan et al. [^ applies the CNN separately 
on visual frames (the spatial stream) and stacked optical 
flows (the motion stream). This approach has been found 
to be more effective, which is adopted as the basis of our 
proposed framework. However, as discussed in Section 1, 
all these approaches 18 33 can only model short-term 
motion, not the long-term temporal clues. 











3. METHODOLOGY 

In this section, we describe the key components of the 
proposed hybrid deep learning framework shown in Figure 
including the CNN-based spatial and short-term motion fea¬ 
tures, the long-term LSTM-based temporal modeling, and 
the video-level regularized feature fusion network. 

3.1 Spatial and Motion CNN Features 

Conventional CNN architectures take images as the inputs 
and consist of alternating convolutional and pooling layers, 
which are further topped by a few fully-connected (FC) lay¬ 
ers. To extract the spatial and the short-term motion fea¬ 
tures, we adopt the recent two-stream CNN approach [33|. 
Instead of using stacked frames in short time windows like |13[ 
, this approach decouples the videos into spatial and mo- 
tiorQ streams modeled by two CNNs separately. Figure 
gives an overview. The spatial stream is built on sampled in¬ 
dividual frames, which is exactly the same as the CNN-based 
image classification pipeline and is suitable for capturing the 
static information in videos like scene backgrounds and basic 
objects. The motion counterpart operates on top of stacked 
optical flows. Specifically, optical flows (displacement vector 
fields) are computed between each pair of adjacent frames, 
and the horizontal and vertical components of the displace¬ 
ment vectors can form two optical flow images. Instead of 
using each individual flow image as the input of the CNN, it 
was reported that stacked optical flows over a time window 
are better due to the ability of modeling the short-term mo¬ 
tion. In other words, the input of the motion stream CNN 
is a 2L-channel stacked optical flow image, where L is the 
number of frames in the window. The two CNNs produce 
classification scores separately using a softmax layer and the 
scores are linearly combined as the final prediction. 

Like many existing works on visual classification using the 
CNN features , we adopt the output of the first FC layer 
of the two CNNs as the spatial and the short-term motion 
features. 

3.2 Temporal Modeling with LSTM 

During the training process of the spatial and the motion 
stream CNNs, each sweep through the network takes one 
visual frame or one stacked optical flow image, and the tem¬ 
poral order of the frames is fully discarded. To model the 
long-term dynamic information in video sequences, we lever¬ 
age the LSTM model, which has been successfully applied to 
speech recognition [^, image captioning [^, etc. LSTM is a 
type of RNN with controllable memory units and is effective 
in many long range sequential modeling tasks without suf¬ 
fering from the “vanishing gradients” effect like traditional 
RNNs. Generally, LSTM recursively maps the input repre¬ 
sentations at the current time step to output labels via a 
sequence of hidden states, and thus the learning process of 
LSTM should be in a sequential manner (from left to right 
in the two sets of LSTM of Figure[^. Finally, we can obtain 
a prediction score at each time step with a softmax trans¬ 
formation using the hidden states from the last layer of the 
LSTM. 

More formally, given a sequence of feature representations 
(xi, X2,...,xt), an LSTM maps the inputs to an output 

^Note that the authors of the original paper [33| used the name 
“temporal stream”. We call it “motion stream” as it only captures 
short-term motion, which is different from the long-term temporal 
modeling in our proposed framework. 
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Figure 2: The framework of the two-stream CNN. 
Outputs of the first fully-connected layer in the two 
CNNs (outlined) are used as the spatial and the 
short-term motion features for further processing. 



Figure 3: The structure of an LSTM unit. 


sequence (yi, y2, • • •, yr) by computing activations of the 
units in the network with the following equations recursively 
from t = 1 to t = T: 

it = cr(Wa^tXt + Whiht-1 + WczCt-i + bt), 
ft = cr(Wa,/Xt + Whfht-1 + Wc/Ct-i + b/), 

Ct = ftCt-i + it tanh(Wa^cXt + Whc^t-i + be), 

ot = a(Wa,oXt + W/ioht-i + WcoCt + bo), 

ht = Ot tanh(ct), 

where xt, ht are the input and hidden vectors with the sub¬ 
scription t denoting the t-th time step, it,ft,Ct,Ot are re¬ 
spectively the activation vectors of the input gate, forget 
gate, memory cell and output gate, Wa /3 is the weight ma¬ 
trix between a and /3 (e.g., Wxi is weight matrix from the 
input Xt to the input gate it), ha is the bias term of a and 
a is the sigmoid function defined as a(x) = ' Figure]^ 

visualizes the structure of an LSTM unit. 

The core idea behind the LSTM model is a built-in mem¬ 
ory cell that stores information over time to explore long- 
range dynamics, with non-linear gate units governing the 
information flow into and out of the cell. As we can see 
from the above equations, the current frame xt and the pre¬ 
vious hidden states ht-i are used as inputs of four parts at 
the t-th time step. The memory cell aggregates information 
from two sources: the previous cell memory unit Ct-i multi¬ 
plied by the activation of the forget gate L and the squashed 
inputs regulated with the input gate’s activation it, the com¬ 
bination of which enables the LSTM to learn to forget in¬ 
formation from previous states or consider new information. 
















































In addition, the output gate Ot controls how much informa¬ 
tion from the memory cell is passed to the hidden states 
ht for the following time step. With the explicitly control¬ 
lable memory units and different functional gates, LSTM 
can explore long-range temporal clues with variable-length 
inputs. As a neural network, the LSTM model can be easily 
deepened by stacking the hidden states from a layer / — 1 as 
inputs of the next layer 1. 

Let us now consider a model of K layers, the feature vector 
xt at the t-ih time step is fed into the first layer of the 
LSTM together with the hidden state hj_i in the same layer 
obtained from the last time step to produce an updated hj, 
which will then be used as the inputs of the following layer. 
Denote fw as the mapping function from the inputs to the 
hidden states, the transition from layer / — 1 to layer I can 
be written as: 

I ^ /> 1 ; 

‘ \ /„.(x*,hi_i), 1 = 1 . 

In order to obtain the prediction scores for a total of C 
classes at a time step t, the outputs from last layer of the 
LSTM are sent to a softmax layer estimating probabilities 
as: 

= exp(We'^hf + be) 

Ec'scexp(wc'^hf+ 6c')’ 

where prohe^ Wc and be are respectively the probability pre¬ 
diction, the corresponding weight vector and the bias term 
of the c-th class. Such an LSTM network can be trained with 
the Back-Propagation Through Time (BPTT) algorithm [^, 
which “unrolls” the model into a feed forward neural net and 
back-propagates to determine the optimal weights. 

As shown in Figure[^ we adopt two LSTM models for tem¬ 
poral modeling. With the two-stream CNN pipeline for fea¬ 
ture extraction, we have a spatial feature set (xf, xl,..., xf-) 
and a motion feature set (x7^, xg^, ..., xj^). The learning 
process leads to a set of predictions (yf, yl, • • •, yr) for the 
spatial part and another set (yi^^y^? • • • ^Yt) for the mo¬ 
tion part. For both LSTM models, we adopt the last step 
output yr as the video-level prediction scores, since the out¬ 
puts at the last time step are based on the consideration of 
the entire sequence. We empirically observe that the last 
step outputs are better than pooling predictions from all 
the time steps. 

3.3 Regularized Feature Fusion Network 

Given both the spatial and the motion features, it is easy 
to understand that correlations may exist between them 
since both are computed on the same video (e.g., person- 
related static visual features and body motions). A good 
feature fusion method is supposed to be able to take advan¬ 
tages of the correlations, while also can maintain the unique 
characteristics to produce a better fused representation. In 
order to explore this important problem rather than using 
the simple late fusion as [^, we employ a regularized fea¬ 
ture fusion neural network, as shown in the middle part of 
Figure^ First, average pooling is adopted to aggregate the 
frame-level CNN features into video-level representations, 
which are used as the inputs of the fusion network. The 
input features are non-linearly mapped to another layer and 
then fused in a feature fusion layer, where we apply regular¬ 
izations in the learning of the network weights. 


Denote N as the total number of training videos with 
both the spatial and the motion representations. For the 
n-th sample, it can be written as a 3-tuple as (x^,xj^,yn), 
where x® = ^n,t € M'*" and ^n,t ^ 

represent the averaged spatial and motion features respec¬ 
tively, and Yn is the corresponding label of the n-th sample. 

For the ease of discussion, let us consider a degenerated 
case first, where only one feature is available. Denote g{-) 
as the mapping of the neural network from inputs to out¬ 
puts. The objective of network training is to minimize the 
following empirical loss: 

N 

™ny]||g(xi)-yif+ Ai<I>(W), (1) 

i = l 

where the first term measures the discrepancy between the 
output ^(xi) and the ground-truth label y^, and the sec¬ 
ond term is usually a Frobenius norm based regularizer to 
prevent over-fitting. 

We now move on to discuss the case of fusion and predic¬ 
tion with two features. Note that the approach can be easily 
extended to support more than two input features. Specif¬ 
ically, we use a fusion layer (see Figure to absorb the 
spatial and temporal features into a fused representation. 
To exploit the correlations of the features, we regularize the 
fusion process with a structural £21 norm, which is defined 

as IIWII2.1 w^j. Incorporating the -£21 norm in 

the standard deep neural network formulation, we arrive at 
the following optimization problem: 

min£ + Ai$(W) + ^||w^|| , (2) 

W 2 II Il2,l 

where C = Eti llfl(x|,xr) - Yif, W® = [Wf,W®] € 
'^PxD (jgnotes the concatenated weight matrix from the E'th 
layer (i.e., the last layer of feature abstraction in Figure [^, 
D = ds dm and P is the dimension of the fusion layer. 

Different from the objective in Equation Q, here we have 
an additional £21 norm that is used for exploring feature cor¬ 
relations in the E-th layer. The term || || 2 ,i computes the 

2-norm of the weight values across different features in each 
dimension. Therefore, the regularization attains minimum 
when contains the smallest number of non-zero rows, 
which is the discriminative information shared by distinct 
features. That is to say, the -£21 norm encourages the matrix 
to be row sparse, which leads to similar zero/nonzero 
patterns of the columns of the matrix W^. Hence it en¬ 
forces different features to share a subset of hidden neurons, 
reflecting the feature correlations. 

However, in addition to seeking for the correlations shared 
among features, the unique discriminative information should 
also be preserved so that the complementary information can 
be used for improved classification performance. Thus, we 
add an additional regularizer to Equation § as following: 

min£ +Ai<I>(W) + ^ llw^ll +A3||w®|| . (3) 

w ^ ^ 2 II Il 2 ,l II lll,l 

The term ||W^||i,i can be regarded as a complement of the 
||W^ 11 2,1 norm. It provides the robustness of the ^21 norm 
from sharing incorrect features among different representa¬ 
tions, and thus allows different representations to emphasize 
different hidden neurons. 

Although the regularizer terms in Equation are all 
convex functions, the optimization problem in Equation ^ 




is nonconvex due to the nonconvexity of the sigmoid func¬ 
tion. Below, we discuss the optimization strategy using the 
gradient descent method in two cases: 

1. For the E-l\\ layer, our objective function has four valid 
terms: the empirical loss, the ^2 regularizer $(W), and 
two nonsomooth structural regularizers, ie., the ^ 2 l 
and -£11 terms. Note that simply using the gradient 
descent method is not optimal due to the two nons¬ 
mooth terms. We propose to optimize the E-l\i layer 
using a proximal gradient descent method, which splits 
the objective function into two parts: 

p = £-l-Ai$(W), 

5=^||w^|| +A3 ||w®|| , 

2 II Il2,l II lli,i 

where p is a smooth function and g is a nonsmooth 
function. Thus, the update of the z-th iteration is for¬ 
mulated as: 

(W®)« = Proxq((W^)^*^ - Vp((W^)^*^)), 
where Prox is a proximal operator defined as: 

Proxg(W) = argrmn ||W - V|| + q(y). 

The proximal operator on the combination of ^ 21/^11 
norm ball can be computed analytically as: 

where Ur- = [|Vr.| — A3]^-szpn[Vr-], and Wr-, Ur-, Vr- 
denote the r-th row of matrix W, U and V, respec¬ 
tively. Readers may refer to for a detailed proof 
of a similar analytical solution. 

2. For all the other layers, the objective function in Equa- 
tionj^only contains the first two valid terms, which are 
both smooth. Thus, we can directly apply the gradient 
descent method as in [^. Denote as the gradient of 
W^, the weight matrix of the Ith layer is updated as: 

(5) 

The only additional computation cost for training our reg¬ 
ularized feature fusion network is to compute the proximal 
operator in the E-th layer. The complexity of the analytical 
solution in Equation 0 is 0{P X D). Therefore, the pro¬ 
posed proximal gradient descent method can quickly train 
the network with affordable computational cost. When in¬ 
corporating more features, our formulation can be computed 
efficiently with linearly increased cost, while cubic opera¬ 
tions are required by the approach of to reach a similar 
goal. In sum, the above optimization is incorporated into 
the conventional back propagation procedure, as described 
in Algorithm^ 

3.4 Discussion 

The approach described above has the capability of model¬ 
ing static spatial, short-term motion and long-term temporal 
clues, which are all very important in video content analysis. 
One may have noticed that the proposed hybrid deep learn¬ 
ing framework contains several components that are sepa¬ 
rately trained. Joint training is feasible but not adopted 
in this current framework for the following reason. The 
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Input : and xj^: the spatial and motion CNN 

features of the n-th video sample; 

Yn- the label of the n-th video sample; 
randomly initialized weight matrices W; 

begin 

for epoch ^ 1 to M do 

Get the prediction error with feedforward 
propagation; 

for I ^ L to 1 do 

Evaluate the gradients and update the 
weight matrices using Equation H; 
if I == E then 

Evaluate the proximal cmerator 
according to Equation Q; 
end 
end 
end 
end 


Algorithm 1: The training procedure of the regular¬ 
ized feature fusion network. 


joint training process is more complex and existing works 
exploring this aspect indicate that the performance gain is 
not very significant. In [^, the authors jointly trained the 
LSTM with a CNN for feature extraction, which only im¬ 
proved the performance on a benchmark dataset from 70.5% 
to 71.1%. Besides, an advantage of separate training is that 
the framework is more flexible, where a component can be 
replaced easily without the need of re-training the entire 
framework. Eor instance, more discriminative CNN models 
like the GoogLeNet and deeper RNN models can be 
used to replace the CNN and LSTM parts respectively. 

In addition, as mentioned in Section 1, there could be 
alternative frameworks or models with similar capabilities. 
The main contribution of this work is to show that such a 
hybrid framework is very suitable for video classification. In 
addition to showing the effectiveness of the LSTM and the 
regularized feature fusion network, we also show that the 
combination of both in the hybrid framework can lead to 
significant improvements, particularly for long videos that 
contain rich temporal clues. 

4 . EXPERIMENTS 

4.1 Experimental Setup 

4.1.1 Datasets 

We adopt two popular datasets to evaluate the proposed 
hybrid deep learning framework. 

UCF-101 [^. The UCE-101 dataset is one of the most 
popular action recognition benchmarks. It consists of 13,320 
video clips of 101 human actions (27 hours in total). The 101 
classes are divided into five groups: Body-Motion, Human- 
Human Interactions, Human-Object Interactions, Playing 
Musical Instruments and Sports. Hollowing [^, we conduct 
evaluations using 3 train/test splits, which is currently the 
most popular setting in using this dataset. Results are mea¬ 
sured by classification accuracy on each split and we report 
the mean accuracy over the three splits. 

Columbia Consumer Videos (CCV) [^. The CCV 
dataset contains 9,317 YouTube videos annotated accord- 








ing to 20 classes, which are mainly events like “basketball”, 
“graduation ceremony”, “birthday party” and “parade”. We 
follow the convention defined in to use a training set of 
4,659 videos and a test set of 4,658 videos. The performance 
is evaluated by average precision (AP) for each class, and we 
report the mean AP (mAP) as the overall measure. 


4.1.2 Implementation Details 

For feature extraction network structure, we adopt the 
VGG_19 and the GNN_M to extract the spatial and 
the motion GNN features, respectively. The two networks 
can achieve 7.5% 34 and 13.5% top-5 error rates on 
the ImageNet ILSVRG-2012 validation set respectively. The 
spatial GNN is first pre-trained with the ILSVRG-2012 train¬ 
ing set with 1.2 million images and then fine-tuned using the 
video data, which is observed to be better than training from 
scratch. The motion GNN is trained from scratch as there 
is no off-the-shelf training set in the required form. In addi¬ 
tion, simple data augmentation methods like cropping and 
flipping are utilized following [33] . 

The GNN models are trained using mini-batch stochas¬ 
tic gradient descent with a momentum fixed to 0.9. In the 
fine-tuning case of the spatial GNN, the rate starts from 
10“^ and decreases to 10“^ after 14K iterations, then to 


10after 20K iterations. This setting is similar to 33 , but 
we start from a smaller rate of 10“^ instead of 10^^ For 
the motion GNN, we set the learning rate to 10~^ initially, 
and reduce it to 10“^ after lOOK iterations, then to 10“^ 
after 200K iterations. Our implementation is based on the 
publicly available Gaffe toolbox with modifications to 
support parallel training with multiple GPUs in a server. 

For temporal modeling, we adopt two layers in the LSTM 
for both the spatial and the motion features. Each LSTM 
has 1,024 hidden units in the bottom layer and 512 hidden 
units in the other layer. The network weights are learnt us¬ 
ing a parallel implementation of the BPTT algorithm with a 
mini-batch size of 10. In addition, the learning rate and mo¬ 
mentum are set to 10“^ and 0.9 respectively. The training 
is stopped after 150K iterations for both datasets. 

For the regularized feature fusion network, we use four 
layers of neurons as illustrated in the middle of Figure 
Specifically, we first use one layer with 200 neurons for the 
spatial and motion feature to perform feature abstraction 
separately, and then one layer with 200 neurons for feature 
fusion with the proposed regularized structural norms. Fi¬ 
nally, the fused features are used to build a logistic regression 
model in the last layer for video classification. We set the 
learning rate to 0.7 and fix Ai to 3 x 10“^ in order to pre¬ 
vent over-fitting. In addition, we tune A 2 and A 3 in the same 
range as Ai using cross-validation. 


4.1.3 Compared Approaches 

To validate the effectiveness of our approach, we com¬ 
pare with the following baseline or alternative methods: (1) 
Two-stream CNN. Our implementation produces similar 
overall results with the original work [^ . We also report the 
results of the individual spatial-stream and motion-stream 
GNN models, namely Spatial CNN and Motion CNN, 
respectively; (2) Spatial LSTM, which refers to the LSTM 
trained with the spatial CNN features; (3) Motion LSTM, 
the LSTM trained with the motion CNN features; (4) SVM- 
based Early Fusion (SVM-EF). y^-kernel is computed 
for each video-level CNN feature and then the two kernels 


are averged for classification; (5) SVM-based Late Fu¬ 
sion (SVM-LF). Separate SVM classifiers are trained for 
each video-level CNN feature and the prediction outputs are 
averaged; (6) Multiple Kernel Learning (SVM-MKL), 
which combines the two features with the ^p-norm MKL 
by fixing p — 2] (7) Early Fusion with Neural Net¬ 
works (NN-EF), which concatenates the two features into 
a long vector and then use a neural network for classifica¬ 
tion; (8) Late Fusion with Neural Networks (NN-LF), 
which deploys a separate neural network for each feature 
and then uses the average output scores as the final pre¬ 
diction; (9) Multimodal Deep Boltzmann Machines 
(M-DBM) [28| [37] , where feature fusion is performed using 
a neural network in a free manner without regularizations; 
(10) RDNN [^, which also imposes regularizations in a 
neural network for feature fusion, using a formulation that 
has a much higher complexity than our approach. 

The first three methods are a part of the proposed frame¬ 
work, which are evaluated as baselines to better understand 
the contribution of each individual component. The last 
seven methods focus on fusing the spatial and the motion 
features (outputs of the first fully-connected layer of the 
CNN models) for improved classification performance. We 
compare our regularized fusion network with all the seven 
methods. 

4.2 Results and Discussions 

4.2.1 Temporal Modeling 

We first evaluate the LSTM to investigate the significance 
of leveraging the long-term temporal clues for video classi¬ 
fication. The results and comparisons are summarized in 
Table The upper two groups in the table compare the 
LSTM models with the two-stream CNN, which performs 
classification by pooling video-level representations without 
considering the temporal order of the frames. On UCF-101, 
the Spatial LSTM is better than the spatial stream CNN, 
while the result of the Motion LSTM is slightly lower than 
that of the motion stream CNN. It is interesting to see that, 
on the spatial stream, the LSTM is even better than the 
state-of-the-art CNN, indicating that the temporal informa¬ 
tion is very important for human action modeling, which is 
fully discarded in the spatial stream CNN. Since the mecha¬ 
nism of the LSTM is totally different, these results are fairly 
appealing because it is potentially very complementary to 
the video-level classification based on feature pooling. 

On the CCV dataset, the LSTM models produce lower 
performance than the CNN models on both streams. The 
reasons are two-fold. First, since the average duration of the 
CCV videos (80 seconds) is around 10 times longer than that 
of the UCF-101 and the contents in CCV are more complex 
and noisy, the LSTM might be affected by the noisy video 
segments that are irrelevant to the major class of a video. 
Second, some classes like “wedding reception” and “beach” do 
not contain clear temporal order information (see Figure]^, 
for which the LSTM can hardly capture helpful clues. 

We now assess the performance of combining the LSTM 
and the CNN models to study whether they are complemen¬ 
tary, by fusing the outputs of the two types of models trained 
on the two streams. Note that the fusion method adopted 
here is the simple late average fusion, which uses the average 
prediction scores of different models. More advanced fusion 
methods will be evaluated in the next subsection. 





UCF-101 

CCV 

Spatial CNN 

80.1% 

75.0% 

Spatial LSTM 

83.3% 

43.3% 

Motion CNN 

77.5% 

58.9% 

Motion LSTM 

76.6% 

54.7% 

CNN + LSTM (Spatial) 

84.0% 

77.9% 

CNN + LSTM (Motion) 

81.4% 

70.9% 

CNN + LSTM (Spatial & Motion) 

90.1% 

81.7% 


Table 1: Performance of the LSTM and the CNN 
models trained with the spatial and the short-term 
motion features respectively on UCF-101 and CCV. 
“-h” indicates model fusion, which simply uses the 
average prediction scores of different models. 



UCE-101 

CCV 

Spatial SVM 

78.6% 

74.4% 

Motion SVM 

78.2% 

57.9% 

SVM-EF 

86.6% 

75.3% 

SVM-LE 

85.3% 

74.9% 

SVM-MKL 

86.8% 

75.4% 

NN-EE 

86.5% 

75.6% 

NN-LE 

85.1% 

75.2% 

M-DBM 

86.9% 

75.3% 

Two-stream CNN 

86.2% 

75.8% 

RDNN 

88.1% 

75.9% 

Non-regularized Eusion Network 

87.0% 

75.4% 

Regularized Eusion Network 

88.4% 

76.2% 


Table 2: Performance comparison on UCF-101 and 
CCV, using various fusion approaches to combine 
the spatial and the short-term motion clues. 


Results are reported in the bottom three rows of Table 
We observe signihcant improvements from model fusion on 
both datasets. On UCF-101, the fusion leads to an absolute 
performance gain of around 1% compared with the best sin¬ 
gle model for the spatial stream, and a gain of 4% for the 
motion stream. On CCV, the improvements are more sig¬ 
nihcant, especially for the motion stream where an absolute 
gain of 12% is observed. These results conhrm the fact that 
the long-term temporal clues are highly complementary to 
the spatial and the short-term motion features. In addition, 
the fusion of all the CNN and the LSTM models trained 
on the two streams attains the highest performance on both 
datasets: 90.1% and 81.7% on UCF-101 and CCV respec¬ 
tively, showing that the spatial and the short-term motion 
features are also very complementary. Therefore, it is im¬ 
portant to incorporate all of them into a successful video 
classification system. 

4.2.2 Feature Fusion 

Next, we compare our regularized feature fusion network 
with the alternative fusion methods, using both the spa¬ 
tial and the motion CNN features. The results are pre¬ 
sented in Table which are divided into four groups. The 
hrst group reports the performance of individual features ex¬ 
tracted from the hrst fully connected layer of the CNN mod¬ 
els, classihed by SVM classihers. This is reported to study 


the gain from the SVM based fusion methods, as shown in 
the second group of results. The individual feature results 
using the CNN, i.e., the Spatial CNN and the Motion CNN, 
are already reported in Table The third group of results 
in Table [ 2 ] are based on the alternative neural network fusion 
methods. Note that NN-EF and NN-LF take the features 
from the CNN models and perform fusion and classihca- 
tion using separate neural networks, while the two-stream 
CNN approach performs classihcation using the CNN di¬ 
rectly with a late score fusion (Figure 2). Finally, the last 
group contains the results of the proposed fusion network. 

As can be seen, the SVM based fusion methods can greatly 
improve the results on UCF-101. On CCV, the gain is con¬ 
sistent but not very signihcant, indicating that the short¬ 
term motion is more important for modeling the human ac¬ 
tions with clearer motion patterns and less noises. SVM- 
MKL is only slightly better than the simple early and late 
fusion methods, which is consistent with observations in re¬ 
cent works on visual recognition [42] . 

Our proposed regularized feature fusion network (the last 
row in Table is consistently better than the alternative 
neural network based fusion methods shown in the third 
group of the table. In particular, the gap between our re¬ 
sults and that of the M-DBM and the two-stream CNN con- 
hrms that using regularizations in the fusion process is help¬ 
ful. Compared with the RDNN, our formulation produces 
slightly better results but with a much lower complexity as 
discussed earlier. 

In addition, as the proposed formulation contains two 
structural norms, to directly evaluate the contribution of 
the norms, we also report a baseline using the same net¬ 
work structure without regularization (“non-regularized fu¬ 
sion network” in the table), which is similar to the M-DBM 
approach in its design but differs slightly in network struc¬ 
tures. We see that adding regularizations in the same net¬ 
work can improve 1.4% on UCF-101 and 0.8% on CCV. 

4.2.3 The Hybrid Framework 

Finally, we discuss the results of the entire hybrid deep 
learning framework by further combining results from the 
temporal LSTM and the regularized fusion network. The 
prediction scores from these networks are fused linearly with 
weights estimated by cross-validation. As shown in the last 
row of Table we achieve very strong performance on both 
datasets: 91.3% on UCF-101 and 83.5% on CCV. The per¬ 
formance of the hybrid framework is clearly better than 
that of the Spatial LSTM and the Motion LSTM (in Ta¬ 
ble [^. Compared with the Regularized Fusion Network (in 
Table , adding the long-term temporal modeling in the 
hybrid framework improves 2.9% on UCF-101 and 7.3% on 
CCV, which are fairly signihcant considering the difficulties 
of the two datasets. In contrast to the fusion result in the 
last row of Table the gain of the proposed hybrid frame¬ 
work comes from the use of the regularized fusion, which 
again verihes the effectiveness of our fusion method. 

To better understand the contributions of the key compo¬ 
nents in the hybrid framework, we further report the per- 
class performance on CCV in Figure We see that, al¬ 
though the performance of the LSTM is clearly lower, fusing 
it with the video-level predictions by the regularized fusion 
network can signihcantly improve the results for almost all 
the classes. This is a bit surprising because some classes 
do not seem to require temporal information to be recog- 

























Figure 4: Per-class performance on CCV, using the Spatial and Motion LSTM, the Regularized Fusion 
Network, and their combination, i.e., the Hybrid Deep Learning Framework. 



Figure 5: Two example videos of class “cat” in the 
CCV dataset with similar temporal clues over time. 


nized. After checking into some of the videos we find that 
there could be helpful clues which can be modeled, even for 
object-related classes like “cat” and “dog”. For instance, as 
shown in Figure we observe that quite a number of “cat” 
videos contain only a single cat running around on the floor. 
The LSTM network may be able to capture this clue, which 
is helpful for classification. 

Efficiency. In addition to achieving the superior classih- 
cation performance, our framework also enjoys high compu¬ 
tational efficiency. We summarize the average testing time 
of a UCF-101 video (duration: 8 seconds) as follows. The ex¬ 
traction of the frames and the optical flows takes 3.9 seconds, 
and computing the CNN-based spatial and short-term mo¬ 
tion features requires 9 seconds. Prediction with the LSTM 
and the regularized fusion network needs 2.8 seconds. All 
these are evaluated on a single NVIDIA Telsa K40 GPU. 

4.2.4 Comparison with State of the Arts 

In this subsection, we compare with several state-of-the- 
art results. As shown in Table our hybrid deep learn¬ 
ing framework produces the highest performance on both 
datasets. On the UCF-101, many works wi th c ompetitive 
results are based on the dense trajectories [^, while 
our approach fully relies on the deep learning techniques. 
Compared with the original result of the two-stream CNN 
in [^, our framework is better with the additional fusion 
and temporal modeling functions, although it is built on our 
implementation of the CNN models which are slightly worse 
than that of (our two-stream CNN result is 86.2%). 
Note that a gain of even just 1% on the widely adopted UCF- 
101 dataset is generally considered as a significant progress. 


UCF-101 

CCV 

Donahue et al. |6| 

82.9% 

Xu et al [ 

48 

1 

60.3% 

Srivastava et al [36 

84.3% 

Ye et al |! 

50 


64.0% 

Wang et al ^ 

85.9% 

Jhuo et al 

L 


64.0% 

Tran et al 40l 

86.7% 

Ma et al 

27 


63.4% 

Simonyan et al [33 

88.0% 

Liu et al 

25 


68.2% 

Lan et al |22| 

89.1% 

Wu et al 

47 


70.6% 

Zha et al 

89.6% 

T 

Ours 

91.3% 

Ours 



1 83.5% 


Table 3: Comparison with state-of-the-art results. 
Our approach produces to-date the best reported 
results on both datasets. 


In addition, the recent works in Im also adopted the 
LSTM to explore the temporal clues for video classification 
and reported promising performance. However, our LSTM 
results are not directly comparable as the input features are 
extracted by different neural networks. 

On the CCV dataset, all the recent approaches relied on 
the joint use of multiple features by developing new fusion 
methods 50 12 2^ 25 Our hybrid deep learning 
framework is significantly better than all of them. 


5. CONCLUSIONS 

We have proposed a novel hybrid deep learning framework 
for video classification, which is able to model static visual 
features, short-term motion patterns and long-term tempo¬ 
ral clues. In the framework, we first extract spatial and 
motion features with two CNNs trained on static frames 
and stacked optical flows respectively. The two types of fea¬ 
tures are used separately as inputs of the LSTM network 
for long-term temporal modeling. A regularized fusion net¬ 
work is also deployed to combine the two features on video¬ 
level for improved classification. Our hybrid deep learning 
framework integrating both the LSTM and the regularized 
fusion network produces very impressive performance on two 
widely adopted benchmark datasets. Results not only verify 
the effectiveness of the individual components of the frame¬ 
work, but also demonstrate that the frame-level temporal 
modeling and the video-level fusion and classification are 
highly complementary, and a big leap of performance can 
be attained by combining them. 









































Although deep learning based approaches have been suc¬ 
cessful in addressing many problems, effective network ar¬ 
chitectures are urgently needed for modeling sequential data 
like the videos. Several researchers have recently explored 
this direction. However, compared with the progress on im¬ 
age classification, the achieved performance gain on video 
classification over the traditional hand-crafted features is 
much less significant. Our work in this paper represents 
one of the few studies showing very strong results. For fu¬ 
ture work, further improving the capability of modeling the 
temporal dimension of videos is of high priority. In addi¬ 
tion, audio features which are known to be useful for video 
classification can be easily incorporated into our framework. 
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